Skip to content

[FLINK-40067][tests] Fix race in RescaleTimelineITCase.testRescaleTerminatedByJobFinished#28633

Open
MartijnVisser wants to merge 1 commit into
apache:masterfrom
MartijnVisser:fix-rescale-timeline-jobfinished-race
Open

[FLINK-40067][tests] Fix race in RescaleTimelineITCase.testRescaleTerminatedByJobFinished#28633
MartijnVisser wants to merge 1 commit into
apache:masterfrom
MartijnVisser:fix-rescale-timeline-jobfinished-race

Conversation

@MartijnVisser

Copy link
Copy Markdown
Contributor

What is the purpose of the change

Fixes a test-side timing race that makes RescaleTimelineITCase.testRescaleTerminatedByJobFinished flaky on loaded CI machines (Azure build 76571, leg test_cron_hadoop313_core). With the short 100ms cooldown shared by the parameterized fixture, the cooldown-driven Idling transition of DefaultStateTransitionManager terminates the in-progress rescale with NO_RESOURCES_OR_PARALLELISMS_CHANGE before the job finishes; goToFinished's later JOB_FINISHED stamp is then dropped (DefaultRescaleTimeline#updateRescale is a no-op once the rescale is terminated), so the awaited condition can never be met. The timeout only became observable after FLINK-40009 made the wait helper's budget real. Both terminal reasons are legitimate product behaviour; this is not a product bug.

Brief change log

  • Skip the disabled-history parameter up front so it does not pay for a cluster rebuild.
  • Rebuild the fixture cluster with a 60s executing cooldown/stabilization via the existing rebuildClusterWithExecutingTimeouts helper, keeping the in-progress rescale alive past the unblock-to-finish window (mirrors the sibling fixes FLINK-39903/FLINK-40010).
  • Widen the final wait budget for headroom on loaded CI legs.

Verifying this change

This change is already covered by existing tests. Ran testRescaleTerminatedByJobFinished 4x and the full RescaleTimelineITCase (30 run, 0 failures, 8 skipped); the method now completes in under 2 seconds, confirming JOB_FINISHED is stamped while the rescale is still in progress rather than waiting out any timer. The original failure needs a starved JobManager thread on a loaded CI machine and is not reproducible locally; the fix removes the dependency on that timer ordering entirely.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no

Was generative AI tooling used to co-author this PR?
  • Yes (Claude Opus 4.8, via Claude Code)

Generated-by: Claude Opus 4.8 (1M context)

…minatedByJobFinished

On a loaded machine the cooldown-driven Idling transition terminates the in-progress rescale with NO_RESOURCES_OR_PARALLELISMS_CHANGE before the job finishes, so goToFinished's later JOB_FINISHED stamp is dropped and the wait can never succeed. Widen the fixture cooldown to keep the rescale in-progress until the job finishes, mirroring FLINK-39903/FLINK-40010.

Generated-by: Claude Opus 4.8 (1M context)
@flinkbot

flinkbot commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants